Mapping of Sequence Reads to the Reference Genomes ◾ 51
found including gene annotation in GFF/GTF file format, GenBank format, and tabular
format. The reference transcriptome (whole mRNA of an organism) and proteins may also
be available as shown in Figure 2.1.
For the alignment/mapping of reads produced by sequencing instruments, we may
need to download a reference genome of the species from which the sequencing raw data
are taken. The sequence of the reference genome must be in the FASTA file format. For
example, to download the FASTA file of the human genome, you can copy the link from
“genome” hyperlink on the Genome database web page and on Linux terminal use “wget”
to download the file to the directory of your choice “e.g. refgenome”:
mkdir refgenome
wget \
-O “refgenome/GRCh38.p13_ref.fna.gz” \
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/
GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.
fna.gz
This script will create the “refgenome” directory, where it will download the compressed
FASTA sequence of the human reference genome “GRCh38.p13_ref.fna.gz”. The size of
the compressed current FASTA sequence file of the human genome (GRCh38.p13) is only
921M. We can decompress it using the “gunzip” command.
gunzip -d GRCh38.p13_ref.fna.gz
This command will decompress the reference genome file to “GRCh38.p13_ref.fna” and
the file size now is 3.1G. A large file can be displayed using a program for displaying a large
text such as “less” or “cat” Linux commands. The reference sequences are in the FASTA file
format. A file contains several sequences representing the genomic units such as chromo-
somes. Each FASTA sequence entry consists of two parts: a definition line (defline), which
is a single line that begins with “>” symbol, and a sequence, which may span several lines.
Figure 2.2 shows the beginning of the human genome reference sequence. Notice that the
defline includes the GenBank accession of the sequence, species scientific name, genome
unit (chromosome number), and the human genome Build. A chromosome sequence may
begin with multiple ambiguous bases (Ns). In Figure 2.2, we removed several lines of Ns
intentionally to show the DNA nucleobases.
The following Unix/Linux commands are used with the text files as general and here we
can use them with FASTA files to collect some useful information.
To display the FASTA file content page by page, you can use “less” command:
less GRCh38.p13_ref.fna
To count the number of FASTA sequences in the FASTA file, use “grep” command:
grep -c “>” GRCh38.p13_ref.fna